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ABSTRACT 

We present a statistical study of correlations and dimensionality of emission lines 
carried out on a sample of over 40,000 SDSS galaxies. Using principal component 
analysis we found that the equivalent widths of the 11 strongest lines can be well 
represented using three parameters. We also explore correlations of emission pattern 
with the eigenspace representation of the continuum spectrum. The observed relations 
are used to provide an empirical prescription for expectation values and variances of 
emission line strengths as a function of spectral shape. We show that this estimation of 
emission lines has a sufficient accuracy to make it suitable for photometric applications. 
The method has already proved useful in photometric redshift estimation 

Subject headings: galaxies: evolution — methods: statistical — techniques: spectro- 
scopic 



1. Introduction 

Galaxy spectral models play an essential role in the interpretation of observational data. It is 
important to be able to characterize models with few parameters as accurately as possible especially 
when working with photometric measurements. Our motivation for exploring the dimensionality 
of the emission pattern and its correlation with the continuum spectrum originates in photometric 
redshift estimation. There are basically two different approaches to determining redshifts of galaxies 
from their multi-band photometric data. One is empirical, for example via fitting the color- 
redshift relation (Connolly et al. (1995)), nearest neighbors (Budavari (2001)), Kd-tree (Csabai 
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et al. (2003)) or artificial neural networks (Firth et al. (2003)). The other approach is based on 
template spectra. The templates might be either empirical, like Coleman, Wu, & Weedman (1980), 
or synthetic, e.g., Bruzual & Chariot (2003) and Fioc &: Rocca-Volmerange (1997). The templates 
are redshifted and convolved with the filter curves of the survey. The simulated fluxes serve as a 
reference set to match with the photometry of the real objects. The best match gives an estimate 
of the redshift and spectral type of a galaxy. (For an overview, see, e.g., Csabai et al. (2003).) In 
many redshift studies synthetic spectra are used, for example in COMBO-17 (Wolf et al. (2004)), 
Hyperz (Bolzonella et al. (2000)), and EAZY (Brammer et al. (2008)). Modern surveys with good 
signal-to-noise ratio (S/N) and resolution, such as the co-added stripe 82 of the Sloan Digital Sky 
Survey (SDSS), PanSTARRS (Kaiser et al. (2005)), or LSST (Tyson et al. (2002)) make even 
a more general photometric parameter inversion realistic Budavari (2009). Beyond redshift, one 
may address other physical parameters as well, such as star formation history, metallicity, or dust. 
Then an ensemble of model spectra parameterized by a set of such observables is the best starting 
point for calibrating the photometric inversion procedure. Not all quantities of interest will be 
unambiguously visible from the photometric data, but their degeneracies, correlations and errors 
can be well assessed by the calibration. Knowing the propagation of the uncertainties allows one to 
minimize the error of a particular observable and thus optimize the entire method. This concept 
assumes realistic model templates. The templates should account for all features contributing to 
the integrated fluxes, i.e., the spectral continua as well as the strongest emission lines. 

A suitable spectral model should include the radiation of stars, ionized gas, and the effect of 
dust. There are works by Stasinska & Leitherer (1996); Moy et al. (2001); Chariot k, Longhetti 
(2001); Panuzzo et al. (2003) that couple these components. Stellar continua are usually modeled 
using population synthesis models, e.g. Bruzual & Chariot (2003), PEGASE (Fioc &: Rocca- 
Volmerange 1997). Emission lines in star forming (SF) H II regions are generated by photoionization 
codes, e.g. PHOTO (Stasinska 1990), CLOUDY (Ferland 1996). In general, a particular model is 
defined by age, metallicity, star formation history and initial mass function of the underlying stellar 
population. Furthermore chemical composition, density and geometry of the ionized gas as well as 
dust content and certain characteristics of dust also define a model. In order to reduce the number 
of free parameters one usually applies simplifying assumptions and self-consistency constraints, i.e. 
empirical relations between the physical quantities. This enables one to produce models described 
with a reasonable accuracy with about three stellar and three gas parameters. 

Principal component analysis (PCA) has proved to be a powerful tool in exploring the 
correlations between emission line properties and continuum spectral characteristics. Sodre & 
Stasinska (1999) and Stasinska & Sodre (2001) used it for statistical analysis of spectral features 
of nearby spiral galaxies. They identified the trends of emission line equivalent widths (EWs) as a 
function of spectral type obtained by PCA. 

SDSS provides data suitable for statistical analyses of nebular emission on a large sample of 
galaxies. There have been numerous studies addressing the physical properties of SDSS emission 
line galaxies (Brinchmann et al. (2004), Kauffmann et al. (2003) ,Tremonti et al. (2004)). The aim 
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of our present study is to elucidate the statistical description of galaxy emission lines. This is 

complementary to the previous studies as it allows one to explore, for example, the contamination 
of photometric magnitudes by emission lines in an efficient way. 

In this paper we perform PCA on emission line EWs of SDSS galaxies in order to find the 
minimal number of independent parameters describing the emission line pattern with a reasonable 
accuracy. We then explore their correlations with the continuum spectrum PCA parameters and 
determine the most probable emission line pattern and its variations as a function of the emission- 
free continuum spectrum. These relations enable us to add emission lines to population synthesis 
model spectra in an empirical way. This prescription was used to improve the empirical spectral 
templates used in SDSS photometric redshift estimation. 

2. Data 

2.1. Description of the SDSS sample 

We studied the emission line data of galaxies selected from the SDSS DR6 database (Adelman- 
McCarthy et al. 2008). The SDSS spectroscopic catalog contains galaxies of all types brighter than 
17.77 mag in the r band (Strauss et al. 2002) and a roughly volume-limited sample of luminous 
red galaxies with redshifts ranging up to z ~ 0.45 (Eisenstein et al. 2001). The spectra are taken 
using 3 arcsec diameter fibers, thus in the lowest redshift galaxies the spectroscopy only samples 
the central region. The data include redshifts, spectral type, measured characteristics of spectral 
lines etc. For a technical overview of SDSS see York et al. (2000). 

The spectral line characteristics published in the spectroscopic catalog are evaluated by an 
automated pipeline. The pipeline is able to identify 48 spectral lines. In order to determine the 
fluxes the lines are fitted by Gaussian profiles. The database lists the fit parameters including 
position, height, width, EW and spectral continuum flux to each line. We use these values for our 
analysis. 

The SDSS spectroscopic catalog also provides quantities that carry useful information on the 
spectral shape of galaxies in a compact form. Connolly et al. (1995) showed that the spectra of 
galaxies form a low-dimensional manifold. Using only a few parameters (say three) the spectra can 
be described with very good precision (99% accuracy) and an objective spectral classification of 
galaxies is also possible. The spectroscopic pipeline applies this dimensional reduction technique 
as detailed in Connolly &; Szalay (1999) and Yip et al. (2004) to obtain the parameters, ecoef fj, 
i = 0...4 for each galaxy. They are the weights of the first five PCA eigenspectra of the SDSS 
galaxies. The derived quantity eclass = aian(— ecoef f i/ecoef f o) characterizes the shape of 
the spectral continuum very well. Its increasing value corresponds to the rising blue end of the 
spectrum and decreasing 4000 A break, i.e. small/large eclass values indicate early/late types. 
(For illustration, see the left panel of Figure 11, which shows the correlation of eclass with the 
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color u — r.) 

2.2. Sample selection 

The sample for the present study was selected as follows. We selected objects that were 

classified by the SDSS photometric pipeline as galaxies. Wc restricted our investigations to the 
strongest 11 emission lines listed in Tabic 1. In order to include only reliable data we required 
the median S/N in the g* band to be more than 10, shq > 10 in the SpecObj data table of the 
SkyServer^ catalog science archive. For all galaxies included in our dataset we required that all lines 
listed in Table 1 are measured (125,832 objects). In order to exclude measurements with extremely 
large errors we set an upper cut in the errors of equivalent widths at 5 A, this affected 5% of the 
parent sample. Our typical errors are between 0.2 and 0.4A, depending on line. The cleanness 
of data manifests e.g. in correct ratios of the known doublet lines. In some cases the line profile 
fit does not resolve the broad Ha A6565 and its two close neighbors [N ii] A6550 and [N ii] A6585 
correctly. In order to remove this systematic error, we excluded objects with blended Ha A6565. 
An undesired effect of this sampling is the exclusion of galaxies that have some broad component 
around Ha, e.g. Seyfert I galaxies. Our criterion to have all 11 lines reliably measured raises a 
concern about completeness in very low and very high mctallicity ranges. In the former group [N ll] 
lines might be unmcasurablc, whereas in the latter group [O III] might be very weak or absent. Wc 
address this question later in Section 3.4. We also required sigma < 2 for the [O ii] AA3727,3730, in 
order to exclude cases where the fit did not resolve the two lines but captured the blended doublet 
instead. We found these particular constraints the most suitable for obtaining both a clean and 
representative dataset. 

Our final sample contains 40,312 galaxies. Since our selection criteria did not include any 

constraints on the sign of the emission line fluxes, our sample includes both emission line galaxies 
and objects with mainly absorption features. Some characteristics of the sample are shown in 
Figure 1, namely distribution of the data in redshift, absolute magnitude, color and eclass. 
Redshift ranges up to z = 0.3, with an average value oi z = 0.07. In both color u — r and continuum 
type eclass there are two groups of objects visible: red, early types at higher u — r and low eclass 
and blue, late types at smaller u — r and larger eclass. As described by Strateva et al. (2001) based 
on a study of SDSS galaxies, the two underlying groups of the bimodal distribution in color space 
can be separated by a single cut at -u — r = 2.22. According to this criterion, 41% of our galaxies are 
red and 59% are blue. The corresponding distribution of spectral types is manifested in eclass too. 
For more details on the distributions of galaxy spectral types in SDSS see Yip et al. (2004). We can 
distinguish between SF galaxies and active galactic nuclei (AGNs) using emission line diagnostics 
based on the line ratios A^2 = log([Nii] A6585/Ha) and OS = log([Oiii] A5008/H/3), which was 
introduced by Baldwin et al. (1981) (BPT). (For the distribution of our data on the BPT diagram 



^http://skyserver.sdss.org 



-5- 



see Figure 13, however, the explanation of the different symbols can be found later in Section 3.4.) 
In order to identify the two groups in our sample we adopt the AGN/SF separator of Kauffmann 
et al. (2003) 

03 = 0.61/(iV2 -0.05) + 1.3. (1) 

Objects with 03 values above this line are classified as AGNs, the rest are clessified as SFs. Based 
on this cut, nearly 50% of the sample are SF galaxies, 18% have an AGN-like emission pattern and 
over 32% of the objects cannot be classified as either because they have at least one of the four 
lines of Equation (1), in most cases H/3 having non positive EW's. These are objects with weak 
overall emission. 

3. Analysis of spectral lines 

3.1. Equivalent width and spectral type distribution 

Figure 2 shows the distribution of the EW values of our sample plotted against the type 
parameter eclass (the smaller the redder). The data show the well-known tendency of emission 
lines becoming stronger from early to late types, see Figure 5 for a few examples. Star formation 
enhances both the blue color of late type galaxies and the strength of the emission lines. (This 
also makes it possible to use certain lines as measures of star formation, see e.g. Kcnnicutt (1998); 
Kewley et al. (2004).) This trend is obvious for all 11 analyzed lines. Early type galaxies located at 
negative eclass values show mainly absorption features and almost no emission independently of 
eclass. This group can be distinguished from late type galaxies situated mostly at positive eclass 
values, whose emission tends to rise with eclass. However, this tendency is not the same for the 
different lines, which originates in different physics of formation. We also see galaxies that have 
seemingly strong absorption at H7A4342 but not at HaA6565, which is puzzling as typically one 
expects the absorption EW at Ha A6565 to be about 60% of that at H7 A4342. Since our EW's are 
the sum of emission and absorption, this probably means that the Ha A6565 absorption is filled 
up by the emission of these galaxies. This effect is enhanced by measurement uncertainties as well. 
Seeming strong absorption values at [O iii] and [S 11] are also present due to measurement errors. 

3.2. Orthogonal approach 

We try to quantify the common trends and differences in the variation of the nebular emission 
pattern using PGA of the EW data. They characterize the emission strength physically as they do 
not depend on distance and the effect of the galactic reddening is canceled by normalization. At the 
same time, EW's are affected by the intrinsic reddening caused by an inhomogeneous distribution 
of star, gas and dust components (Calzetti et al. 1994; Gharlot &; Fall 2000). Because we are 
interested in data as they are observed in photometric measurements, we choose not to correct for 
the intrinsic reddening. 
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PCA is a linear transformation of data vectors to the eigensystem of their correlation matrix. 
It results in an uncorrelated representation of the data, and makes it possible to identify the most 
relevant directions of variation by ranking them according to their information content. 

PCA of our data was carried out as follows. We represented the EW's of each galaxy with an 
M = 11 dimensional vector y, with the average y subtracted. Diagonalizing the covariance matrix 
we obtained the orthonormal set of M eigenvectors or principal components (PCs) e'^. We ordered 
them by their eigenvalues A*' (normalized to unit trace) since they express the relative information 
content of each eigenvector. For each galaxy the expansion coefficients of the vector y form the 
new M dimensional vector c: 

M 

fe=i 

The transformation of vector y to vector c corresponds to a simple rotation of the data vectors to 
the basis where their correlation matrix is diagonal. Inverting the transformation we obtain the 
original vectors again. However, if we truncate the expansion coefficients at some m < M, the data 
will not be exactly restored. The effect of omitting the kth principal component is the reduction 
of the variance of the truncated EW estimator 

m 

y(-)=J^Cfce^ (3) 

k=l 

by A*^ with respect to the original distribution of the data. Hence, the eigenvalues are actually a 
measure of the importance of each eigenvector to reconstructing the real distribution of the data. 



3.3. The principal components 

Figure 4 shows the results of PCA of the EW vectors. The average EWs for each of the 11 
lines (vector y) are plotted in the top panel. Below these are the first five eigenvectors e'^ ordered 
by their eigenvalues. Their information content is 89.1%, 7.8%, 1.8%, 0.7% and 0.2% of the total 
variance, respectively. The numerical results are summarized in Table 2. The meaning of each PC 
can be easily interpreted by comparing the weight of the zth line in the kth eigenvector with the 
mean EW of the zth line y^. 

The first eigenvector is very similar to the average vector y. It means, the most important 
variation in emission line EWs is simply a constant multiplicative factor in the amplitude that 
varies from galaxy to galaxy. A larger ci indicates stronger nebular emission. The PC is dominated 
by the strongest line Ha A6565. 

The second eigenvector represents mostly the [O iii] AA4960,5008 and nitrogen lines. The 
coefficients of [O iii] AA4960,5008 and [N ii] AA6550,6585 have opposite signs. The same holds for 
[O ii] AA3727,3730 and [N ii] AA6550,6585 in the eigenvector e^. This enables the nitro gen emission 
lines to vary independently of oxygen in the reconstructed emission-line spectrum. The EW data 



- 7- 



(Figure 2) show that [N ii] emission grows continuously, slowly from early to late types, while the 
oxygen lines have a stronger type dependence becoming steep especially for the extremely blue 
galaxies. Due to the higher ionization degree, the behavior of the [O III] doublet as a function of 
type is different from the other lines: while showing no significant emission at low and moderate 
eclass values, there is a steep rise at eclass > 0.5. Nitrogen in these components is relatively 
strong compared to the Balmer lines, thus both and 63 influence the [N ii]/Ha A6565 ratio. 

These eigenvectors do not significantly change the ratios of lines in the same doublet since 
their weights normalized by their average values are nearly equal. The constant ratios have physical 
reasons and it is a strong effect which persists in these components. 

In the two lines of the [0 11] AA3727,3730 doublet are represented with opposite signs. The 
effect of this component is to change their ratio. This eigenvector reflects the measurement errors 
of the [0 11] AA3727,3730, which is difficult to deblend with the resolution of SDSS spectroscopy. 

Vector contains mostly [N 11] AA6550,6585, with some weak Balmer and oxygen lines. Thus 
we expect that it might influence the precise reconstruction of [N 11] lines. 

Doublet lines in some higher PCs often appear with opposite sign, which is mainly the effect 
of errors. We do not detail the further components as their variance is less than 0.2% for each, 
they are dominated by noise. 

3.4. Eigenspace representation of emission line data 

The distribution of the data in the subspace of the first three PCs is shown in Figure 5. The 
range occupied by galaxies is approximately a two-dimensional curved manifold. Most of the objects 
form a triangular region that is closely parallel to the axes e^ and e^ and closely perpendicular to 
e^, having small C3 coordinates. The distribution also has a 'head' at the lower end in c\ and a 'tail' 
having large c\ values. The galaxies in the 'tail' also have significant contribution from the third 
PC but arc still located on the curved surface which is a continuation of the main locus described 
above. The fact that PCA overestimates the dimensionality of the data is a known limitation of the 
method. It is because PCA is a linear transformation while the physics of emission lines generates 
nonlinear structure (see Figure 2). 

As shown in Figure 6, the points in the triangular main locus embedded in the subspace 
of the first three eigenvectors can be generated by two vectors originated in a point C (Chan 
et al. 2003). Their coordinates in (ci, C2, C3) space are: C = (—22,5,0), u = (102,-45,15), 
V = (102,15,-15). The origin and the vectors correspond to certain emission pattern recovered 
from three PCs according to Equation (3). As the figure shows, the origin has almost no emission. 
Thus, since the region is situated approximately in the (e^,e^) plane, the EW values in these two 
vectors alone can give us some idea of physical interpretation of the first two PCs. 

As indicated in Section 3.3, the overall strength of the emission is manifested in the first 



-8- 



coefEcient ci. We define the relative emission line fiux fraction /j, as the total emission line fiux 
normalized by the continuum flux within the range 3728 — 6733 A. If we plot this quantity for 
each galaxy in the plane (e^,e^), we can see the monotonic growth of fi with ci (Figure 7). It is 
apparent that the gradient of n is almost parallel to the axis e^. The inset plot shows a linear 
relation between /i and ci ; the 'head' of the distribution has an emission fiux fraction of less than 
1%, while the largest ci galaxies have /x ^ 0.5, i.e., equal fiux contribution from continuum and 
emission lines in the analyzed wavelength interval. 

Another striking effect in Figure 6 is that while oxygen lines are relatively strong, the nitrogen 
emission is weak in vector v, v — u, the difference, shows negative [N ii] lines. This indicates a 
difference in metallicity. We estimated the ratio of oxygen and hydrogen abundances from the ratio 
of [N ii] A6585 and Ha fluxes using the empirical formula of Pettini & Pagel (2004). 

12 + log(0/H) = 8.9 + 0.57 log ([N ii] A6585/Ha) . (4) 

As this estimator is calibrated for H ii galaxies we excluded objects dominated by non-thermal 
emission from this analysis by requiring log([N ii] A6585/Ha) < —0.3. Figure 8 shows that the 
vector V — u points approximately in the direction of the negative metallicity gradient: at a fixed 
contribution from vector u the objects tend to have smaller metallicities if the mixing ratio of v 
is larger, i.e. upwards in the diagram. The 'head' has large metallicity values which indicate old 
galaxies. 

In Figure 6 data are plotted only up to ci = 300, however, the 'tail' of the distribution reaches 
nearly ci = 700. The vector w together with the previous two vectors generates the long tail. Its 
endpoint E[A] = (500,250,80) is out of the range of the figure, so only a fifth of the vector, w/5 
is plotted. Point E represents galaxies with extremely strong nebular emission. Since vector w 
carries very strong EW values compared to vector v (or u), the spectrum of the point E, as well 
as the points near the large ci end of the distribution, is very similar to that of vector w. Very 
strong Balmer lines and [O iii] AA4960,5008, weak [O ii] AA3727,3730, and nitrogen deficiency can 
be observed. Even though our data are not corrected for reddening, the large [Olll]/[Oll] ratio 
is a real, strong effect. It implies a large ionization parameter of the emitting gas, which increases 
with ci at large ci values. In summary, the galaxies of the 'tail' have extremely strong emission, 
very low metallicities and high ionization parameters which indicates that they are young bursting 
objects. 

Figure 9 shows the color u — r in the subspace of the first and the second PC. A red 'head' 
and a blue 'tail' can be seen, and the color becomes continuously bluer toward large ci values. This 
indicates a close relation of the distribution in the emission line PC space to the bimodality seen 
in color. The histograms at the bottom of the diagram indicate the distribution of red and blue 
galaxies defined by the u — r = 2.22 cut of Strateva et al. (2001). The transition between the two 
distributions is around ci w — 15 where we have an equal number of red and blue galaxies in our 
sample. By a cut at ci = — 15 we can select 93%/86% of blue/red galaxies, with around 10%/11% 
contamination from the other group, respectively. 
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The color-ci , C2 correlation indicates that ci must be strongly correlated to the continuum 
shape of the spectral energy distribution (SED). Figure 10 shows the relation between eclass and 
the first principal components. The ci-eclass relation is very similar to the relation of ci and 
— (it — r). This is not surprising as they both measure the same effect: the difference between the 
intensity of the blue and the red end of the spectrum. Their relation is illustrated in Figure 11 
and resembles two linear relations, one for the early type objects and the other for the late type 
objects. The u — r = 2.22 cut in color corresponds to a cut at eclass~ —0.05. As shown by the 
histogram in Figure 11 (bottom), it also roughly corresponds to the inflection point of the eclass 
distribution. The color distribution is also shown projected to the right margin of the diagram. 
In fact, the separator lines lie somewhat blueward from the inflection points in both eclass and 
color, which might be the effect of undersampling of early types by our selection. If we consider 
eclass= —0.05 as the separator of early and late spectral types, we can see that the separation 
is even slightly clearer than in the case of colors. The cut selects 93%/88% of late/early spectral 
types with a fraction of 8%/ll% of early /late type objects misclassificd by the cut. The correlation 
of eclass and the relative emission parameter /i is shown in the right panel of Figure 11. We will 
discuss the issue of the emission line PC's and spectral type in more detail later in Section 4. 

The absolute magnitude in the r-band is shown in Figure 12. Apart from the large scatter, the 
objects are generally fainter at larger ci. At the largest ci values only low luminosity objects are 
present, with typical ~ —17. This is in concordance with the earlier studies of SDSS galaxies 
by Tremonti et al. (2004) who found that the most metal deficient galaxies are faint. 

Figure 13 shows the connection of the first two PCs to the AGN/SF diagnostic diagram. The 
line ratios N2 versus OS are plotted in the left panel, together with the AGN/SF separator line of 
Equation (1). The points are colored by the first principal component, using a cut at ci = —5 which 

appears to be the most reliable c-cut separating the AGN from the SF. Red symbols denote ci < —5, 
blue symbols indicate ci > —5. This separator lies blueward of the color or spectral type separator, 
as a significant fraction of AGNs has a bluer/later spectral type and has more contribution from 
emission lines than do the average red galaxies. Up to the mixing at the lower edge of the BPT 
diagram, the two ci regions roughly agree with the AGN/SF separation: the ci cut selects 82%/84% 
of SFs/AGNs with 6.5%/37% contamination from the other group, respectively. We note that the 
ci = -5 cut selects 92%/58% of SF/AGN as defined by Stasinska et al. (2006), with 34%/10% 
contamination from the other group, respectively. The cut of Stasinska et al. (2006) classifies more 
objects as AGN and is therefore more consistent with a higher cut, about ci ~ 0. The separation 
of the types is not well defined partly because of the mixing of SF and AGN activity in some low 
emission galaxies. In the right panel of Figure 13 AGN (red) and SF galaxies (blue), selected by 
Equation (1) are plotted on the ci : C2 diagram. The vertical line is at ci = —5. The plot shows 
the same subset of objects as in the left panel, which means all galaxies where Equation (1) is not 
applicable are excluded. They arc low emission objects having non-positive Ha, H/3, [N ll] A6585 
or [O III] A5008. Nearly all of them (99.9%) are at ci < —5. Note that low nebular emission 
galaxies are underrepresented due to the selection criteria. If present, the missing galaxies would 
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populate the 'head' too. We also note that AGN with broad Ha emission hnes are excluded from 
our sample. With our present sample, the 'head' (at ci < —5) consists of low emission objects, 
31% of which could be classified as AGNs. The plot confirms that the 'tail' consists of SF galaxies. 
The classification using ci and C2 is also reminiscent of the (W(Hq;), [Nil] /Ha) diagram proposed 
by Cid Fernandes et al. (2010), which distinguishes among SFs, AGNs and LINER-like galaxies, 
since ci is closely connected to the Ha EW. However, because of our selection criteria on equivalent 
width, we have very few LINER-like galaxies in our sample, these are situated at the lowest ci 
values. 

Now we return to the question of missing low and high metallicity objects, rejected because 
of the EW criterion. According to a check carried out in both a low (7.5 - 7.6) and a high (8.6 - 
8.8) metallicity bin, indeed, both ranges are underrepresented by about one-third in our selection, 
if compared to the rejected group. However, there is still a sufficient number of both low and 
high metallicity galaxies in the selected sample for correlation analyses. Although for the relative 
occurence of the emission patterns our sample might not be conclusive, it is still suitable for studying 
the relations between emission lines and other observables. 

We can conclude that PCA isolates two groups of objects: red, early spectral type, low emission, 
high metallicity, bright galaxies, with a significant fraction of AGNs in the 'head' and the rest of 
the objects which are blue, late spectral type, high emission, lower metallicity, fainter, star forming 
galaxies. In the main locus, there is a gradient of all quantities listed above. All these characteristics 
get continuously more prominent and reach extreme values toward the end of the 'tail'. 

The low effective dimensionality of the emission pattern found by PCA is not surprising. As 
already discussed in the Introduction, the reason is the dominance of only a couple of mechanisms. 
The largest impact on the emission line flux pattern has the relative importance of SF vs. AGN, 
the second most important influencing factor is metallicity. These generate the effectively two- 
dimensional locus of galaxies in the emission line space. 



We studied the effect of truncation of the principal component basis on the restored EWs. We 
examined the convergence of the truncated EW estimator (Equation(3)) as a function of the number 
of eigenvectors used for the reconstruction. The error of the estimation can be characterized using 
the residuals added in quadrature over all lines 



3.5. Reconstructing spectral lines 
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(5) 



1=1 



Figure 14 shows Ay("*) averaged over the whole sample as well as for three ci bins. Remember 
that larger ci values mean stronger emission and later spectral type. For the earliest bin ci < 
the contribution of the emission lines is so small that estimating by average (m=0 case) produces 
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larger error than ignoring the emission hnes, i.e. setting EW=0 for all lines. We demonstrate the 
error of the estimation by zero EWs for eclass< —0.05 galaxies by an arrow at the left margin of 
the diagram. For early spectral types, ignoring the emission would mean a lOA error if summed 
over all lines. 

If we drop all eigenvectors with eigenvalues less than 1%, we will have the first three PC's. 
Their total percent variance for the case m = 3 is 98.6%. Using the first three components we can 
reconstruct the total emission with 3A average precision. The error is type dependent, its absolute 
value increases with ci. For the strongest emission bin the average Ay(^) is 8A. However, unlike 
the absolute error, the relative error defined as 

5y("^)=Ay(-)/|y| (6) 

is smaller for stronger emission objects. The average relative error is 25% for the whole sample, 
dominated by the error of objects having small EW values. For the extremely strong emission bin 
it is only 5%. 

We show the errors of the strongest, most important lines individually. In Figure 15 we showed 
the convergence of the residuals of the individual EWs of [O ii] A3730, [O in] A5008, [Nil] A6585 
and Ha A6565: 

^yt^ = \yt^-yi\ (7) 

averaged over the sample. Similar to the overall emission represented by the summed residuals, all 
the individual lines are reconstructed with an average error not larger than 2A using three PCs (or 
not larger than 4A using two PCs). For the early type objects at eclass< —0.05, we repeated the 
estimation by zero emission line flux similar to Figure 14. If we set all EWs to zero, the errors of 
the individual lines are still below 5Afor this group, as shown by the errors in the left margin of 
the plot. 

We checked the effect of the truncation on physical quantities such as metallicity and emission 
flux fraction if these were determined using three-PC-reconstructed EW data. The relative emission 
line flux fraction /x is plotted in Figure 16, truncated to the first three PCs versus the original. The 
rms error of the reconstruction is as small as 0.001 which means an estimation of the emission line 
flux fraction within a precision of 0.1%. The lines reconstructed from the first three PC's obey 
the known ratios of the doublet lines, especially [O III] AA4960,5008 and [Nil] AA6550,6585 with a 
relatively good precision. The line ratio [Nil] A6585/[Nii] A6550 in the reconstruction using the 
first three eigenvectors is 3.26, whereas the fitted ratio in the original data is 3.23. For [Oiii], the 
fitted [O in] A5008/[O in] A4960 ratio is 3.06 for the reconstructed data and 3.01 for the original 
data. These features are so strong that only higher PC's begin to violate them by including noise 
components. However, the flux ratios of lines that are not in the same doublet are not restored 
precisely. If we want to use the [N ii] A6585/Hq; ratio for diagnostic purposes to distinguish between 
thermal emission of H ii regions from the AGN-like emission, we need the first three PCs and the 
fifth PC as well. The fifth eigenvector is the one that makes possible an efficient fine-tuning of this 
ratio as it contains Ha and [N ii] with opposite signs. Figure 17 shows that metallicity estimation 
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with the first three PC's has a relatively large error, which can be suppressed by the inclusion of 

the fifth eigenvector. The rms errors of the reconstructed metallicity are 0.23 and 0.13 without and 
with es, respectively. The reconstruction is less precise at high metallicities. This is because of the 
weak Ha and H/3 lines, since, as described in Section 3.4, these are typically early type galaxies. 

4. Correlation of spectral lines and continuum features 

The motivation of this study was to explore the connection between the continuum spectral 
type and the emission pattern of the emission line galaxies. Although it is clear that there is no 
one-by-one relation, as a first step we disregard the variations and focus on the systematic trends. 
For the practical applications, we would like to make predictions about theemission lines leased on 
continuum parameters. This can then be used, e.g., to add emission lines to galaxy model SED's 
which only contain stellar populations. 

We characterize the continuum spectrum with the three most informative coefficients of the 
spectral principal component expansion ecoeffo, ecoeff i and ecoeff2. Wc investigate their link 
to the first three emission line PCA coefficients ci, C2 and ca which proved to be essential in 
reconstructing the emission with sufficient accuracy. Figure 18 illustrates how these parameters are 
linked to each other. The data points in all diagrams are colored by ecoeffo, ecoef f i and ecoeff 2. 
For late type objects with significant emission lines a mapping from the first three ecoeff 's to the 
first three q appears to be possible. 

The correlation of each of the first three q are plotted against eclass in the top panels of 
Figure 19. The first coefficient ci exhibits the strongest correlation with the continuum features. 
This coefficient also has the largest information content. There are apparent systematic trends 
in C2 and C3 as well. Given a continuum spectral type, we can determine the expectation values 
and variances of the emission line EW's based on these empirical relations. The ontinuum PCs 
ecoef fo-ecoeff2 carry even more information that can be used to establish an empirical relation. 
As indicated in the previous section and as one can see in the top panels of Figure 19, early type 
galaxies might be fitted by constant values which would yield nearly zero fiux. However, now we 
choose to treat all data equally. We have checked that the two approaches do not make a significant 
difference. We fit a second order polynomial of three variables ecoeffo, ecoeffi and ecoeff2 to 
each of ci, C2 and C3, 

2 22 
Ci = -I- ecoef ffe -I- ^^7f ecoef ffe ecoef f;. (8) 

k=0 k=0 l=k 

The fitted coefficients are listed in Table 3. We can use this empirical relation to estimate emission 
properties solely from continuum features of the spectra. The residuals of the fit, Ci(fit) — q, which 
characterize the goodness of the estimation, are shown on the bottom panels of Figure 19. The 
rms error of the fit residuals is plotted as a function of spectral type too. The origin of the scatter 
is mainly the cosmic variance, which includes the effect of geometry (Yip et al. (2010)) and other 
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physical parameters not fully covered by the spectral classification parameter. The scatter becomes 
large toward the largest eclass values. However, as the flux values themselves are large here, the 
resulting relative flux error is smaller than for earlier types. 

For practical applications, we also list the ecoeffj fits for four lines (Ha, [On], [Oiiijand 
[Nil]) in Table 4. For each doublet, we added up the values of the two individual lines. However, 
this approach does not recover fiux information in such a compact form as fitting the PCs. While 
with Table 3 we restore 99.7% of the emission line flux with just three components. Table 4 gives 
only 94.4% of the flux with four fits to a total of seven lines, and does not contain information on 
individual lines except for Ha. 

We use the fitted Ci values of Table 3 to reconstruct the emission lines analogously to 
Equation (3): 

3 

y = Y.~^ke\ (9) 

k=l 

We compare the emission lines coming from this estimator with the measured values and calculate 
the errors described in Equations (5) - (7), by substituting y for y^"^\ The errors of this prediction 
are plotted in Figures 14 and 15 with small arrows at the right margin of each plot. We find that 
the strength of the total nebular emission can be predicted from the spectral continuum with an 
average accuracy of 5A or 40% for the entire sample. However, for the objects of greatest interest - 
those having significant emission - the relative precision is better. The average errors axe 10A(20%) 
for the < ci < 100 bin, and 25A(?alO%) for the strongest emission bin (ci > 100). Estimating by 
average only, without using the continuum dependence the error can be as large as k. 100%. 

We can estimate, how well the reconstruction of emission lines works in terms of photometry. 
We simulate photometry by convolving the SDSS filters with the spectra of the objects. We 
investigate the impact of emission lines by omitting them, convolving just the continuum and 
comparing these magnitudes with the values obtained from the entire spectrum (continuum+lines). 
The results for the 3, r, and i bands are shown in the first and second rows of Figure 20. The impact 
of the nebular lines is strongly type-dependent. For the strongest emission objects at high eclass, 
the magnitude difference due to the lack of emission lines can reach 0.5"* in the g band. This is the 
eflFect of [O in] AA4960,5008 and [O 11] AA3727,3730. The largest difference in the r and i bands is 
~ 0.2^". The significance of the emission line contribution becomes striking when compared to the 
typical photometric uncertainties. (For an r 19"* galaxy the photometric data have a typical error 
of 0.03"*, 0.025"* and 0.07"* in the rand i bands, respectively.) Note the redshift dependence due 
to lines being redshifted into and out of the filters. Most apparent is the r band, the low redshift 
hump comes mainly from Ha and the high redshift hump from [O III] AA4960,5008, which are then 
redshifted into the i band. On the bottom three plots we estimated the emission lines from ecoeffj, 
using the fitted Cj values and Equation (9). We carried out simulated photometry with continuum 
+ predicted lines, and compared this with photometry simulated with real emission lines. The 
results show that the prediction approximates the original values with a maximum error ^ 0.1"* 
in g for the strongest emission objects and 0.05"* for r and i. The rms error is of the order of 
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0.01™ for the extremely high emission bin eclass> 0.6 and « 0.001"* for eclass< 0.6. (We note 

that the early types at eclass< —0.05 have errors not larger than this even if no emission line 
flux is added.) This precision is sufficient for the most photometric applications. For example, one 
can add spectral lines to model SEDs of stellar populations or any empirical spectra with missing 
emission lines in a way that makes the continuum features and the emission pattern consistent with 
the observations. 

As an example of practical applications a similar method has been successfully applied to 
improve the spectral templates used in SDSS photometric redshift estimation. In SDSS, we used 
a hybrid photometric redshift (photo-z) algorithm that holds the advantages of both the template 
fitting and empirical approaches. Here we use semi-empirical spectral templates that are based on 
the Coleman, Wu, &: Weedman (1980) (CWW) empirical spectra with an additional extension of 
the UV and IR ends with Br uzual- Chariot model spectra. These are, actually, the most widely used 
templates in the empirical template fitting photo-z applications. In our approach, the templates 
are iteratively trained on a reference set of galaxies with known redshifts and photometry, so that 
they match better the photometry of the SDSS data (Budavari (2000)). The quality of photo-z 
estimation clearly improves by the use of repaired templates. However, we see instabilities during 
the iterative procedure around the positions where our paper's analysis would predict a strong line. 
We attribute this effect to the discrepancy between the continuum-line correlations of the CWW 
templates and the SDSS data sample. To correct for this, we used a modified algorithm, where the 
emission lines in the CWW spectra are replaced using the empirical continuum-line correlations, 
discussed in this study. This modified method was indeed used in SDSS data releases DR4, DR5 
and DR6 (Adelman-McCarthy et al. (2006, 2007, 2008)), where this ingredient has reduced the 
error on the photo-z estimate by 10% for the bluest galaxies. In Table 5, we show the effect of 
this procedure on the photo-z results in detail. The photo-z type t parameterizes the spectral type, 
with denoting the reddest galaxies, and 55 denoting the bluest galaxies. We list the rms photo-z 
error as a function of type, for the cases when the procedure is carried out using the original or 
the modified CWW spectra. While the estimate of the early type objects remains unchanged as 
expected, there is an improvement increasing with type. Beyond this, we consider the stabilization 
of the training procedure as the main achievement. 

We have shown how the relations shown in Figure 19 can be used to predict the expectation 
values of EWs based on the spectral type. The variance may serve as additional information when 
a simulated distribution of the emission pattern is generated. 

5. Conclusions 

We considered a sample of over 40,000 SDSS galaxy spectra (emission lines and continua) 
coming mostly from cores due to the aperture effect of the survey. Using PCA of EWs of the 
11 selected emission lines, we found that nearly 99% of information is included in the subspace 
generated by the first three eigenvectors. They reconstruct emission line fluxes within a precision 
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of 5%-25%, depending on spectral type. We found that based on a three-dimensional eigenspace 
representation of continuum spectra there is a simple way of estimating the most probable emission 
line pattern and its variation, which makes it possible to determine the total photometry from the 
continuum spectrum in the investigated bands with a precision < 0.1™". The applications include the 
comparison of photometric observations with models, e.g., determining K-corrections and absolute 
magnitudes. The prescription for adding lines on template spectra has been successfully applied to 
improve the precision of photometric redshift estimation. 

The authors acknowledge sTipport from the following grants: OTKA-MB08A-80177, MRTN- 
CT-2004-503929, NKTH: RET14/2005, KCKHA005, and Polanyi. 

Funding for the creation and distribution of the SDSS Archive has been provided by the 
Alfred P. Sloan Foundation, the Participating Institutions, the National Aeronautics and Space 
Administration, the National Science Foundation, the U.S. Department of Energy, the Japanese 
Monbukagakusho, and the Max Planck Society. The SDSS Web site is http://www.sdss.org/. 
The SDSS is managed by the Astrophysical Research Consortium (ARC) for the Participating 
Institutions. The Participating Institutions are The University of Chicago, Fermilab, the Institute 
for Advanced Study, the Japan Participation Group, The Johns Hopkins University, Los Alamos 
National Laboratory, the Max-Planck-Institute for Astronomy (MPIA), the Max-Planck-Institute 
for Astrophysics (MPA), New Mexico State University, Princeton University, the United States 
Naval Observatory, and the University of Washington. 



-16- 



REFERENCES 

Adelman-McCarthy, J.K. et a/.2006, ApJS 162, 38 
Adelman-McCarthy, J.K. et a/.2007, ApJS 172, 634 
Adelman-McCarthy, J.K. et a/.2008, ApJS 175, 297 
Baldwin, J.A, Phillips, M.M., & Terlevich, R. 1981, PASP, 93, 5 
Bolzonella, M., Miralles, J.-M., & Pello, R. 2000, A&A 363, a7,-492 
Brammer, G.B., van Dokkum, P.G., & Coppi, P. 2008, ApJ, 686, 1503 

Brinchmann, J., Chariot, S., White, S. D. M., Tremonti, C, Kauffmann, G., Heckman, T., 
Brinkmann, J. 2004, MNRAS, 351, 1151 

Bruzual, G., k Chariot, S. 2003, MNRAS, 344, 1000 

Budavari, T., Szalay, A. S., Connolly, A. J., Csabai, I., Dickinson, M. 2000, AJ, 120, 1588 
Budavari, T.et al.2001, AJ, 122, 1163 

Budavari, T. 2009, ApJ, 695, 747B 

Calzetti, D., Kinney, A.L., & Storchi-Bergmann, T. 1994, ApJ, 429, 582 

Chan, B.H.P., Mitchell, D.A., & Cram L.E. 2003, MNRAS, 338, 790 

Chariot, S. & Fall, S.M. 2000, ApJ, 539, 718 

Chariot, S. & Longhetti, M. 2001, MNRAS, 323, 887 

Coleman, G. D., Wu, C.-C, & Weedman, D. W. 1980, ApJS, 43, 393 

Connolly, A.J, Szalay, A.S., Bershady, M.A., Kinney, A.L., & Calzetti, D. 1995, AJ, 110, 1071 
Connolly, A. J. & Szalay, A. S. 1999, AJ, 117, 2052 
Csabai, I. et a/.2003, AJ, 125, 580 
Eisenstein, D. et al. 2001, AJ, 122, 2267-2280 

Ferland, G. J. 1996, Hazy, a Brief Introduction to CLOUDY (Lexington, KY: Univ. of Kentucky 
Internal Report), 565 

Cid Fernandes, R., Stasinska, G., Schlickmann, M. S., Mateus, A., Vale Asari, N., Schoenell, W., 
Sodre, L. 2010, MNRAS, 403, 1036 

Fioc, M., & Rocca-Volmerange, B. 1997, A&A, 326, 950 



-17- 



Firth A.E., Lahav O., & Somerville R.S. 2003, MNRAS, 339, 1195 

Kaiser, N. and PanSTARRS Team, 2005, BAAS, 37, 1409 

Kauffmann, G. et a/.2003, MNRAS, 346, 1055 

Kennicutt, R. C, Jr. 1998, ARA&A, 36, 189 

Kewley, L. J., Geller, M. J., k Jansen, R. A. 2004, AJ, 127, 2002 

Moy, E., Rocca-Volmerange, B., & Fioc, M. 2001, A&A, 365, 347 

Panuzzo, P., Bressan, A., Granato, G. L., Silva, L., & Danese, L. 2003, A&A, 409, 99 

Pettini, M. & Pagel, B.E.J. 2004, MNRAS, 348, 59 

Sodre, L. & Stasinska, G. 1999, Ak A, 345, 391 

Stasinska, G. & Leitherer, C. 1996, ApJS, 107, 66 

Stasinska, G. 1990, A&AS, 83, 501 

Stasinska, G. & Sodre, L. 2001, A& A, 374, 919 

Stasinska, G., Cid Fernandes, R., Mateus, A., Sodre, L., Asari, N. V. 2006, S, 317, 972 

Strateva, I. et al. 2001, AJ, 122, 1861 

Strauss, M. et al. 2002, AJ, 124, 1810-1824 

Tremonti, C. et al. 2004,ApJ, 613, 898 

Tyson, J. A. 2002, Proc. SPIE, 4836, 10 

Wolf C, et al.2004, AkA, 421, 913 

Yip, Ching-Wa et al. 2004, AJ, 128, 585 

Yip, C.-W., Szalay, A. S., Wyse, R. F. G., Dobos, L., Budavari, T., i, I. 2010, ApJ, 709, 780 
York, D. et al. 2000, AJ, 120, 1579 



This preprint was prepared with the AAS IM^^X macros v5.2. 



- 18 - 




1.05 0.1 0.15 0.2 0.25 0.3 

redshift 





0.05 0.1 0.15 0.2 0.25 0.3 
redshift 



0.05 0.1 0.15 0.2 0.25 0.3 
redshift 



Fig. 1. — Distribution of absolute magnitude (left), color u — r (middle) and the continuum spectral 
type parameter eclass (right) vs. redshift shown for a 10% random subset of our sample. Two 
distinct groups of galaxies can be identified in both color and spectral type. 
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Fig. 2. — EW - eclass distribution of the selected 11 spectral lines. Galaxy counts are plotted in 

grayscale. The spectral type parameter eclass (x axis) is small/large for early/late type galaxies. 
The EWs (in A, y axis) of all lines show a strong type dependence. The absorption dominated early 
type objects are situated at negative eclass values. At positive eclass the EWs of the emission 
line galaxies increase with type. 
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Fig. 3. — Composite galaxy spectra from early type with no emission (left) to emission rich late 
type (right). 
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Fig. 4. — Mean vector y and the first five eigenvectors of EW's. For eacli eigenvector, A denotes 
the relative information content. See explanation in Section 3.3. 
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Fig. 5. — Distribution of the emission line galaxies in the subspace of the first and second (left) 
and the first and the third (right) principal components. PCA shows that the data form a roughly 
two dimensional manifold in the 1 1-dimensional EW space. The inset plots show the low ci region 
zoomed in - the distribution separates into a 'head' and a 'tail'. 
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Fig. 6. — EW data projected to the subspace of the first two principal components. Inset plots show 
SD-reconstructed EW's corresponding to the origin C and some representative directions. The two 
vectors u and v generate the main locus occupied by the majority of emission line galaxies. Their 
difference v — u contributes to the spectrum in the sense of enhancing oxygen, at the same time 
depressing nitrogen lines when going in the direction from u to v. The vector w together with 
the previous two vectors generates the strongest emission spectra. It ends out of the range of this 
figure in the point E= (500,250,80), the shown vector w/5 has the same direction and fifth the 
length of w. The spectrum of point E (not shown) is very similar to that of vector w. Very strong 
Balmer lines and [O iii] AA4960,5008, weak [O ii] AA3727,3730, as well as nitrogen deficiency can 
be observed. 
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Fig. 7. — Variation of the relative emission line flux /i averaged over bins (grayscale) shows a linear 
relation with ci (inset plot). 
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Fig. 8. — Variation of metallicity with the first two PCA coefficients. The quantity 12+log(0/H) 
estimated using Equation (4) averaged over bins is plotted in grayscale. The metallicity decreases 
in the direction of vector Cr, which is close to vector u — v of Figure 6. The inset plot shows the 
distribution of metallicity along Cr. 
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Fig. 9. — Distribution of u — r color in the plane (e^,e^). Main plot: color averaged over pixels 
(u — r color-coded, red means u — r> 2.22, blue u — r< 2.22, gray is the transition between them). 
The lowest ci values are dominated by red objects, blue becomes dominant at higher ci values. 
Transition from red to blue types {u — r = 2.22) is around ci = —20. Inset plot: u — r vs. ci is 
a monotonic relation, the higher ci, the bluer objects. However, low ci ranges have mixed colors, 
the relation gets tighter at higher ci. Histograms at the bottom: ci distribution of the two color 
types. Red [u — r > 2.22) subset plotted with red dashed line, blue u — r < 2.22 subset with blue 
solid line. 
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Fig. 10. — Distribution of the spectral type in the plane (e^,e^). Main plot: eclass averaged over 
pixels. Inset plot: eclass vs. ci. We can see the same tendency for eclass as for color in Figure 
9. Early types (negative eclass) are at negative ci values, late types (positive eclass) are at 
higher ci. Histograms at the bottom: ci distribution of the two spectral type bins. Red dashed 
line: eclass< —0.05, blue solid line: eclass> —0.05. 
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Fig. 11. — Left panel: color u — r vs. spectral type eclass. The relation is nearly linear, apparent 
bimodality. The u — r = 2.22 separator of blue and red types (vertical dotted line) corresponds 
to eclass~= —0.05 (horizontal dotted line). The histogram of the eclass distribution is plotted 
on the X axis. The distribution of u — r is projected to the right margin. Right panel: connection 
between spectral type and relative emission parameter fi. 
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Fig. 12. — r-band absolute in the plane of the first two PC's. Main plot: Mr averaged over bins 
(grayscale) in the plane (e-'^,e^). Inset plot: vs. ci. On average, luminosity decreases with 
increasing ci, however, the scatter is large. The strongest nebular emission objects are the faintest 
ones, with ~ —17. 
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Fig. 13. — Left: N2:03 diagnostic diagram for distinguishing between star forming galaxies and 
AGN. The dotted hne shows the AGN separator of quation (1). The two types of symbols are 
selected by ci = —5 cut. Righ: AGN (red) and SF (blue) defined by the separator n the right 
panel. AGN are situated at ci < 0. A part of the 'head' is missing because of some negative values 
among the EW's. In both panels, the scatter plots show a 5% random subsample. 




Fig. 14. — Convergence of the spectral line reconstruction. The error of the reconstruction as a 
function of truncation limit is plotted, m: number of eigencomponents kept; m = represents the 
reconstruction using the mean EW for each line, no PCs. Left: EW residuals summed over all lines. 
Right: relative error. The results are shown for all galaxies (solid thick line) and for three ci bins; 
larger ci values indicate stronger nebular emission. All EW's can be well reconstructed using the 
first three eigencomponents. The single arrow at the left margin of the left panel denotes the error 
of ignoring the emission line flux in eclass< —0.05 early type objects, see explanation in the text. 
The arrows at the right margin show the same quantities for the prediction made from continuum 
expansion coefficients, see explanation in Section 4. 
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Fig. 15. — Reconstruction of four selected emission lines using m eigencomponents. Ayf. sample- 
averaged absolute EW error, m: same as in Figure 14. The arrows at the left margin show the error 
of estimation by zero emission line flux for the eclass< —0.05 subset. The arrows at the right 
margin show the same quantity for the prediction made from continuum expansion coefficients, see 
explanation in Section 4. 
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Fig. 16. — Reconstruction of the relative emission strength using truncated eigenbasis. The rms 
error is 0.001. 
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Fig. 17. — Reconstruction of metallicity using truncated eigenbasis. With the first three 
eigenvectors, the error is relatively large. Adding the fifth eigenvector suppresses the error 
significantly. 



Table 1. Analyzed Emission Lines 



Line name 


Rest wavelength [A] 


[0 II] A3727 


3727.09 


[0 ii] A3730 


3729.88 


H7 A4342 


4341.68 


H/? A4863 


4862.68 


[0 III] A4960 


4960.29 


[0 III] A5008 


5008.24 


[N 11] A6550 


6549.86 


Ha A6565 


6564.61 


[N 11] A6585 


6585.27 


[Sii] A6718 


6718.29 


[S 11] A6733 


6732.67 



Note. — Reference SDSS table, 
see also SpecLineName table in the 
catalog science archive. 
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Fig. 18. — Data points plotted in the planes ecoef f i:ecoef f 2 (top left), ecoef f i:ecoef f 3 (top 
right), ci : C2 (bottom left) and ci : C3 (bottom left). The coloring is made by rgb-coding of 
ecoeffi (green), ecoeff2 (blue), ecoeff3 (red). The same coloring is applied to the emission line 
PCA subspaces. By matching the points of the same colors in the various plots we can see how the 
ecoeff j regions are mapped into the Ci space. The image shows, that (at least for ecoeffi< 0.1, 
which means positive eclass, later type objects) there is a mapping. Early types are not resolved, 
however, they are located in the 'head' of the distribution, having weak emission lines. 
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Fig. 19. — Top: PC coefficients as a function of eclass. Bottom: the residuals of the first three 
emission line PC coefficients after subtracting the continuum fit, as a function of the continuum 
type ecoeff. An empirical connection between the continuum spectral shape and the nebular 
emission pattern can be established from the observed correlations. The average residual scatter 
after subtracting the fitted estimator is plotted with blue dotted lines as a function of the spectral 
type. 
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Fig. 20. — Difference in g, r and i magnitude, if the photometry is made without emission lines, as a 
function of spectral type (first row) and redshift (second row). Late spectral types at larger eclass 
having stronger nebular emission will have larger errors. Redshift dependence shows pattern caused 
by lines being redshifted into and out from the filter. Third row: magnitude errors of the simulated 
photometry with emission line from the continuum fit, as a function of spectral type. 
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Table 2. The first five eigenvectors 



Line name 


ei 


62 


63 


64 


65 




(0.891) 


(0.078) 


(0.018) 


(0.007) 


(0.002) 


[0 II] A3727 


0.176 


-0.033 


-0.518 


-0.819 


-0.095 


[0 ii] A3730 


0.225 


-0.046 


-0.750 


0.574 


-0.148 


H7 A4342 


0.061 


-0.012 


-0.021 


0.000 


0.171 


H/3 A4863 


0.176 


-0.072 


-0.012 


0.001 


-0.031 


[0 III] A4960 


0.124 


0.265 


0.040 


0.001 


-0.070 


[0 III] A5008 


0.390 


0.805 


0.089 


0.002 


-0.278 


[N 11] A6550 


0.039 


-0.118 


0.072 


-0.001 


-0.254 


Ha A6565 


0.821 


-0.289 


0.271 


0.018 


0.315 


[N 11] A6585 


0.127 


-0.380 


0.213 


-0.003 


-0.822 


[S 11] A6718 


0.125 


-0.143 


-0.159 


-0.004 


0.127 


[S 11] A6733 


0.091 


-0.104 


-0.098 


-0.008 


0.017 



Note. — The first five EW principal components ordered by their relative 
information content. The eigenvalue of each eigenvector is given in round 
brackets in the column header. 
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Table 3. The fit polynom coefficients of q 





Cl 


C2 


C3 


a 


1657.808 


674.528 


508.194 




-4390.826 


-191.740 


-639.650 


Pi 


488.087 


108.530 


-27.376 




-31.382 


-150.234 


-16.725 


Too 


2717.923 


-481.908 


132.141 


701 


-610.956 


-59.598 


25.392 


702 


-153.612 


226.695 


-2.644 


711 


546.841 


-498.071 


-245.997 


712 


212.847 


82.486 


-36.546 


722 


433.339 


-617.698 


-375.677 



Note. — Fitted polynom coefficients of 
equation (8) for ci , C2 and C3 as a function 
of ecoef fj. 

Table 4. The fit polynom coefficients of selected lines 





Ha 


[Oiii] 


[On] 


[Nil] 


a 


1178.370503 


2655.215341 


308.461569 


-61.094836 


Po 


-3486.511250 


-4071.111890 


-1406.454947 


-443.963678 


/3i 


281.981021 


1477.873084 


625.325629 


-215.602859 




-100.191558 


-36.292149 


38.422858 


-15.750437 


700 


2315.964983 


1415.882134 


1103.062075 


510.520196 


701 


-398.852773 


-1443.357079 


-664.083788 


153.924612 


702 


-86.340819 


40.335051 


-90.761428 


-77.564287 


711 


550.149412 


-229.362689 


579.736300 


149.155350 


712 


80.078273 


309.940485 


143.727637 


-85.912611 


722 


454.181171 


-646.157142 


663.971699 


210.729024 



Note. — Fitted polynom coefficients of equation (8) for Ha, [O in], 
[On] and [Nii] as a function of ecoeffj. 
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Table 5. The rms error of photo-z 





t 


original 


added lines 


all 


0-55 


0.0694 


0.0680 


red 


0-20 


0.0557 


0.0556 


blue 


21-55 


0.0876 


0.0843 


bluest 


50-55 


0.1512 


0.1374 


Note. ■ 
different 


— The rms error of photo-z for 
galaxy types, for two different 



templates sets. Column 3 lists the error 

of the rcdshift estimate when the original 
CWW templates are trained and used as 
templates. Column 4 shows the results 
in the case when the emission lines are 
replaced before the training. The type 
parameter t goes from to 55, being 
the reddest, 55 the bluest type. 



